VirtualEATING

Andrew Lane, University of California, Berkeley

Overview

CRISPR-EATING is a molecular biology protocol to generate libraries of CRISPR guide RNAs. The use of this this approach to generate a library suitable for chromosomal locus imaging requires ways to avoid regions that will be processed into non-specific guides, which (in part) is what these scripts are designed to achieve.

These scripts contain a set of functions that are assembled into a workflow to:

  • Predict the sgRNA spacers produced when a particualr substrate DNA is subjected to the EATING protocol described in Lane et al., Dev. Cell (2015).
  • Score those peredicted guides for specificity within a genome from which a BLAST database and an implementation of the CRISPR guide scoring algorithm described in Hsu et al (2013).
  • Using the score information, pick out sub-regions within the substrate DNA that will produce clusters of high-specificity guides and design PCR primers to amplify only those regions.

Following the generation of suitable PCR primers from this tool, the "wet" portion of the protocol is as follows:

  1. The output PCR primers (144 pairs in 144 separate reactions in the case of the labled 3MB region) are used to amplify from the substrate DNA.
  2. The resulting products are pooled and subjected to the EATING molecular biology protocol.
  3. When complexed to dCas9-mNeonGreen (or other fluorescent protein), the resulting library can be used to image your desired locus.

Prerequisites

  1. Some experience with Python and the very basics of Biopython and BLAST
  2. A Python installation with biopython, pickle
  3. A BLAST database generated from the genome against which you would like to score your guides and a working BLAST installation. To generate the BLAST database, use a FASTA file containing your genome of interest. For example, LAEVIS_7.1.repeatmasked.fa. Use the following syntax to generate the BLAST DB. (The -parse_seqids flag is critical; the guide scoring algorithm expects a database generated using this flag).
     makeblastdb -in LAEVIS_7.1.repeatmasked.fa -dbtype nucl -parse_seqids -out xl71 -title ‘xl71’
    
    This was tested using makeblastdb version 2.2.29+. Perform a test BLAST query on your database to check that your installation can find it.
  4. The original FASTA file used to make the BLAST database must also be available; this is necessary so that it can be determined whether a guide BLAST database hit is adjacent to a PAM and therefore relevant for score determination. The entire genome is loaded entirely into memory in the current implementation and thus you need a computer with enough RAM (8-16GB) for this. (Future updates may remove this requirement)

References:

Hsu PD, Scott DA, Weinstein JA, Ran FA, Konermann S, Agarwala V, et al. DNA targeting specificity of RNA-guided Cas9 nucleases. Nat Biotechnol. Nature Publishing Group; 2013;31: 827–832. doi:10.1038/nbt.2647

Using this notebook and adapting it for a particular purpose

The basic EATING-related logic is in the eating module (eating.py). This module contains functions (prefixed with "al_") to predict the guides that will be generated from an input DNA sequence and score the guides.

Import modules


In [1]:
import Bio 
from Bio.Blast.Applications import NcbiblastnCommandline
from Bio import SeqIO
from Bio.Blast import NCBIXML
from Bio import Restriction 
from Bio.Restriction import *
from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import cPickle as pickle
import subprocess
import matplotlib
from eating import *
import multiprocessing as mp
from operator import itemgetter, attrgetter, methodcaller
import numpy

%pylab inline


Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['power', 'linalg', 'draw_if_interactive', 'random', 'fft', 'info']
`%matplotlib` prevents importing * from pylab and numpy

Set up input files

1. The FASTA file to be scored


In [2]:
fasta_file = SeqIO.parse("../../Genomic Data/LAEVIS_7.1.repeatMasked.fa", "fasta")

2. The genome against which generated guides are scored against

See Prerequisites above for explanation.


In [3]:
handle = open("../../Genomic Data/LAEVIS_7.1.repeatMasked.fa", 'rb')
xl71genome = SeqIO.parse(handle, "fasta", alphabet=IUPACAmbiguousDNA())
xl71genomedict = {}
for item in xl71genome:
    xl71genomedict[item.id] = item

In [4]:
len(xl71genomedict)


Out[4]:
410604

Begin custom processing

The FASTA file we've loaded (fasta_file) contains the entire X. laevis genome. The X. laevis genome hasn't yet been definitively assembled into physical chromosomes - instead, it's a large number of contigs or "scaffolds". For the purposes of making a library that labels a single region, we want to work with a big piece that we know is contiguous. So, we find the longest "scaffold".


In [5]:
longest = 0
for item in fasta_file:
    if len(item) > longest:
        longest = len(item)
        longscaffold = [item]

In [6]:
print(longscaffold[0].name + " is the longest scaffold at " "{:,}".format(len(longscaffold[0])) + " bp in length.")


Scaffold102974 is the longest scaffold at 21,560,636 bp in length.

Next, we want to digest this scaffold into guides. This uses the al_diggesttarget function contained in eating.py to produce a generator of scores.


In [7]:
cutslist = al_digesttarget(longscaffold)

In this version of the script, the output from al_digesttarget isn't especially readable, but for reference:

  • Each item is a SeqRecord (see BioPython docs)
  • The Sequence is the guide 20mer, written from 5' to 3'
  • The ID is the cut-fragment of DNA of that an enzyme produces, counting from the left (i.e. the most 5' guide has an id of 1) and the strand that the guide is found on (F or R, where F is forward with respect to the input DNA), starting with all the HpaII cuts, then all the BfaI cuts, then all the ScrFI cuts. Note that the script predicts the results when each digestion is done in a separate tube, rather than when all enzymes are used as a mixture (which would kill some guides where cut sites of two different enzymes are <20 bp apart).
  • The name is the sequence position of the left edge of the guide along the input DNA. For forward-direction guides, this is position of the 5' end of the guide. For reverse, it's position of the 3' end of the guide.
  • The description is the enzyme that generates the guide's cut site.

In this example (the most 5' 1500 bp of the chosen Scaffold), HpaII does not cut. Note that enzyme recognition sites are palindromic and thus recognizes a palindromic sequence containing a PAM on both strands. This results in a guide being generated on both sides of the cut site.


In [8]:
[item for item in al_digesttarget([longscaffold[0][0:1500]])]


Out[8]:
[SeqRecord(seq=Seq('GCGCTGGCCAGAACGTTCTC', SingleLetterAlphabet()), id='1_F', name='788', description='BfaI', dbxrefs=['Scaffold102974']),
 SeqRecord(seq=Seq('AATGTCTTCTCCACGATTCC', SingleLetterAlphabet()), id='2_R', name='810', description='BfaI', dbxrefs=['Scaffold102974']),
 SeqRecord(seq=Seq('TAAAGGAGAAGGAAACCCCC', SingleLetterAlphabet()), id='2_F', name='1444', description='BfaI', dbxrefs=['Scaffold102974']),
 SeqRecord(seq=Seq('AGGGGAGGGATTTTTGTGCC', SingleLetterAlphabet()), id='3_R', name='1465', description='BfaI', dbxrefs=['Scaffold102974']),
 SeqRecord(seq=Seq('GGAGGGACAGCAGCTGGGCC', SingleLetterAlphabet()), id='4_F', name='734', description='ScrFI', dbxrefs=['Scaffold102974']),
 SeqRecord(seq=Seq('ATCTTTCTTGCCCCCCCCCC', SingleLetterAlphabet()), id='5_R', name='754', description='ScrFI', dbxrefs=['Scaffold102974'])]

Next, we'd like to take the 20mers extracted and score them against the entire Xenopus laevis genome. These lines score each guide variable region for specificity using the xl71 BLAST database and the xl71genomedict dict.


In [ ]:
def multiscore_pool(x):
   score = al_scoreguide(x, "xl71", xl71genomedict)
   return (score[0], score[1])

In [11]:
http://sebastianraschka.com/Articles/2014_multiprocessing_intro.html#An-introduction-to-parallel-programming-using-Python's-multiprocessing-module
pool = mp.Pool(processes=2)
results = [pool.apply(multiscore_pool, args=(x,)) for x in cutslist]
pickle.dump(results, open( "finalpicklescores.pkl", "wb" ))
pool.close()

The format of the resulting data is (score, guide).


In [53]:
results[0:20]


Out[53]:
[(15,
  SeqRecord(seq=Seq('TGCGCAATAGGAGCATTTAC', SingleLetterAlphabet()), id='0_F', name='1639', description='HpaII', dbxrefs=['Scaffold102974'])),
 (2,
  SeqRecord(seq=Seq('CATGCGCAGTAAATCCGTAC', SingleLetterAlphabet()), id='1_R', name='1661', description='HpaII', dbxrefs=['Scaffold102974'])),
 (3,
  SeqRecord(seq=Seq('TCTGTCTCCTACCCCATGTC', SingleLetterAlphabet()), id='1_F', name='1785', description='HpaII', dbxrefs=['Scaffold102974'])),
 (8,
  SeqRecord(seq=Seq('CAGTCGGGGAGACACGGGAC', SingleLetterAlphabet()), id='2_R', name='1807', description='HpaII', dbxrefs=['Scaffold102974'])),
 (7,
  SeqRecord(seq=Seq('CGCTCGAGCACGCACTCAGC', SingleLetterAlphabet()), id='2_F', name='1889', description='HpaII', dbxrefs=['Scaffold102974'])),
 (100,
  SeqRecord(seq=Seq('AATCTGGCCAGCCACATCAC', SingleLetterAlphabet()), id='3_R', name='1911', description='HpaII', dbxrefs=['Scaffold102974'])),
 (91,
  SeqRecord(seq=Seq('GATTGCCTAGGGCATTTGGC', SingleLetterAlphabet()), id='3_F', name='1927', description='HpaII', dbxrefs=['Scaffold102974'])),
 (100,
  SeqRecord(seq=Seq('GGCCAATC', SingleLetterAlphabet()), id='4_R', name='1949', description='HpaII', dbxrefs=['Scaffold102974'])),
 (100,
  SeqRecord(seq=Seq('CGGATTGGCC', SingleLetterAlphabet()), id='4_F', name='1937', description='HpaII', dbxrefs=['Scaffold102974'])),
 (100,
  SeqRecord(seq=Seq('GGGATTGGTGCACACGGAGC', SingleLetterAlphabet()), id='5_R', name='1959', description='HpaII', dbxrefs=['Scaffold102974'])),
 (6,
  SeqRecord(seq=Seq('GGGCGCCTGATTTGTAAATC', SingleLetterAlphabet()), id='5_F', name='2064', description='HpaII', dbxrefs=['Scaffold102974'])),
 (99,
  SeqRecord(seq=Seq('AAAGGGGGTTTATTCAGGGC', SingleLetterAlphabet()), id='6_R', name='2086', description='HpaII', dbxrefs=['Scaffold102974'])),
 (100,
  SeqRecord(seq=Seq('CATGTACCATTGTATCGAAC', SingleLetterAlphabet()), id='6_F', name='2277', description='HpaII', dbxrefs=['Scaffold102974'])),
 (15,
  SeqRecord(seq=Seq('GCACATTTGGAGCATTTTCC', SingleLetterAlphabet()), id='7_R', name='2299', description='HpaII', dbxrefs=['Scaffold102974'])),
 (100,
  SeqRecord(seq=Seq('TTCCACTCTGCGTATGTCCC', SingleLetterAlphabet()), id='7_F', name='3233', description='HpaII', dbxrefs=['Scaffold102974'])),
 (100,
  SeqRecord(seq=Seq('AACTGAAATAGACACAGCGC', SingleLetterAlphabet()), id='8_R', name='3255', description='HpaII', dbxrefs=['Scaffold102974'])),
 (62,
  SeqRecord(seq=Seq('ATAGTTTCCTCCCTAAAGGC', SingleLetterAlphabet()), id='8_F', name='3404', description='HpaII', dbxrefs=['Scaffold102974'])),
 (100,
  SeqRecord(seq=Seq('ATTGGAGTCAATAATTGGAC', SingleLetterAlphabet()), id='9_R', name='3426', description='HpaII', dbxrefs=['Scaffold102974'])),
 (95,
  SeqRecord(seq=Seq('GAAAGGGCTGATTCCATCAC', SingleLetterAlphabet()), id='9_F', name='3490', description='HpaII', dbxrefs=['Scaffold102974'])),
 (100,
  SeqRecord(seq=Seq('AAGAGAAGCCTATTAAGAGC', SingleLetterAlphabet()), id='10_R', name='3512', description='HpaII', dbxrefs=['Scaffold102974']))]

The scores in this object are an ordered list, with all HpaII scores first, then all BfaI scores and finally all ScrFI scores. We are interested in the distribution of scores along the DNA fragment, irrespective of the enzyme used to generate them. Thus, we want to rearrange the list with all scores from 5' to 3'.


In [93]:
import copy
a = []
for (score, details) in results:
    a.append(int(details.name)) # The guide's name attribute contains its position in bp
resultssorted = zip(results, a)
resultssorted = sorted(resultssorted, key=itemgetter(1), reverse=False)
resultssorted = [item for item, null in resultssorted]

In [125]:
resultssorted[:5]


Out[125]:
[(94,
  SeqRecord(seq=Seq('GGAGGGACAGCAGCTGGGCC', SingleLetterAlphabet()), id='66855_F', name='734', description='ScrFI', dbxrefs=['Scaffold102974'])),
 (64,
  SeqRecord(seq=Seq('ATCTTTCTTGCCCCCCCCCC', SingleLetterAlphabet()), id='66856_R', name='755', description='ScrFI', dbxrefs=['Scaffold102974'])),
 (66,
  SeqRecord(seq=Seq('GCGCTGGCCAGAACGTTCTC', SingleLetterAlphabet()), id='20037_F', name='788', description='BfaI', dbxrefs=['Scaffold102974'])),
 (64,
  SeqRecord(seq=Seq('AATGTCTTCTCCACGATTCC', SingleLetterAlphabet()), id='20038_R', name='810', description='BfaI', dbxrefs=['Scaffold102974'])),
 (2,
  SeqRecord(seq=Seq('TAAAGGAGAAGGAAACCCCC', SingleLetterAlphabet()), id='20038_F', name='1444', description='BfaI', dbxrefs=['Scaffold102974']))]

In [126]:
resultssorted[-5:]


Out[126]:
[(100,
  SeqRecord(seq=Seq('GCACAATATGTTCAAATTGC', SingleLetterAlphabet()), id='66852_R', name='21559690', description='BfaI', dbxrefs=['Scaffold102974'])),
 (100,
  SeqRecord(seq=Seq('GAATTGCAATTTGCGATGTC', SingleLetterAlphabet()), id='66852_F', name='21559782', description='BfaI', dbxrefs=['Scaffold102974'])),
 (0,
  SeqRecord(seq=Seq('TTTGAAATCTGACATGGGGC', SingleLetterAlphabet()), id='66853_R', name='21559804', description='BfaI', dbxrefs=['Scaffold102974'])),
 (100,
  SeqRecord(seq=Seq('CTGTAACTGTCGCATGGATC', SingleLetterAlphabet()), id='66853_F', name='21560446', description='BfaI', dbxrefs=['Scaffold102974'])),
 (100,
  SeqRecord(seq=Seq('GATCCCATTCAGTTTATTTC', SingleLetterAlphabet()), id='66854_R', name='21560467', description='BfaI', dbxrefs=['Scaffold102974']))]

Let's extract the scores and plot their distribution on a histogram.


In [127]:
scores = [score for score, details in resultssorted]

In [128]:
def plot_score_histogram(scores):
    '''
    Input is a list of scores only (as ints)
    '''
    path = '/Library/Fonts/Microsoft/Arial.ttf'
    prop = matplotlib.font_manager.FontProperties(fname=path)
    matplotlib.rcParams['font.family'] = prop.get_name()
    bins = range(0,106,5)
    figure()
    hist(scores, bins, color="gray")
    tick_params(axis=u'both', labelsize=18)
    #savefig('Scaffold score distribution.pdf', format="pdf")

In [130]:
plot_score_histogram(scores)


So, there are ~5000 guides that are quite non-specific (score <= 4) and >14,000 guides that have a score of 100 and a further 4000 that score between 95 and 99.

Finding clusters of high-scoring guides

To make a library of useful guides, we'd like to PCR through continuous clusters of the highest-scoring ones. Our oligonucleotide vendor (IDT) has a minimum order of 288 oligos (3x 96-well plates) on a small and relatively inexpensive scale (5 pmol). To work within this limitation, we'd like to pick out 144 possible regions to PCR-amplify.

If we are only willing to accept guides with a score of 100, we'd predict that our 144 PCR products will be short (there are probably few long spans of perfect-scoring guides). However, if we relax our requirement to >=99, we may get longer PCR products and thus more guides in our library. How does this scale at different cutoffs/thresholds?


In [215]:
def find_clusters_by_cutoff(resultssorted, x):
    starts=[]
    ends=[]
    previtemgood = 0
    for index, (score, details) in enumerate(resultssorted):
        if score >= x and previtemgood ==0 and len(details) >= 20: #this avoids guides that are shorter than 20 bp (from where an enzyme cuts twice in close proximity)
            starts.append((index, score, int(details.name)))
            previtemgood = 1
        elif score >= x and previtemgood == 1 and len(details) >=20:
            None
        elif previtemgood == 1:
            previtemgood =0
            ends.append((index-1, resultssorted[index-1][0], int(resultssorted[index-1][1].name)))
    run_positions = zip(starts, ends)
    goodruns_length = sorted([end - start for (start, i, j), (end,l,m) in run_positions], reverse=True)
    return (goodruns_length, run_positions)

In [216]:
threshold = range(0, 105, 5)
probeyield = []
for item in threshold:
    probeyield.append((item, sum(find_clusters_by_cutoff(resultssorted, item)[0][0:143])))

In [217]:
print(probeyield)


[(0, 22490), (5, 9307), (10, 8216), (15, 7902), (20, 7659), (25, 7348), (30, 7125), (35, 6831), (40, 6457), (45, 6323), (50, 5891), (55, 5035), (60, 4791), (65, 4237), (70, 3748), (75, 3577), (80, 3481), (85, 3432), (90, 3256), (95, 2894), (100, 1719)]

In [148]:
%pylab inline
figure()
plot([b for b, c in probeyield], [c for b, c in probeyield], "o")


Populating the interactive namespace from numpy and matplotlib
Out[148]:
[<matplotlib.lines.Line2D at 0x118c3e050>]

Our "yield" of guides descends steadily from a cutoff of >=5 to a cutoff of >=95, then drops from 2894 guides produced at a cutoff of 95 to 1719 at 100. So, a cutoff of >=95 balances specificity and yield.


In [205]:
threshold = 95
runs = find_clusters_by_cutoff(resultssorted, threshold)[1]
#(countofguides, (startguidenumberfrom5', startscore, startpositionbp), (endguidenumberfrom5', endscore, endpositionbp))
goodruns = sorted([((i_end - i), (i, s, pos), (i_end, s_end, pos_end)) for (i, s, pos), (i_end, s_end, pos_end) in runs], reverse=True)

We next asked what happens if we concentrate the guides into a smaller region. To test this, we cut the input DNA into sections of 1/7 the ~21MB starting length and asked how many guides would be obtained if 144 PCR products were designed within each of those subregions.


In [ ]:
probeyield = []
x = 95
fraction = 7.0
overlap = 2.0
region_to_extract = len(resultssorted)/fraction

for i in [float(item)/overlap for item in range(int(overlap*fraction+2.0))]:
    goodruns = find_clusters_by_cutoff(resultssorted[int(region_to_extract*i):int(region_to_extract*(i+1))], x)[0]
    probeyield.append((i, int(region_to_extract*i), sum(goodruns[0:143])))
    if sum(goodruns[0:143]) == 0:
        break

In [248]:
probeyield


Out[248]:
[(0.0, 0, 1948),
 (0.5, 16251, 1977),
 (1.0, 32502, 2070),
 (1.5, 48754, 1939),
 (2.0, 65005, 1781),
 (2.5, 81257, 1948),
 (3.0, 97508, 1980),
 (3.5, 113760, 1836),
 (4.0, 130011, 1966),
 (4.5, 146262, 2004),
 (5.0, 162514, 1985),
 (5.5, 178765, 1980),
 (6.0, 195017, 2113),
 (6.5, 211268, 1923),
 (7.0, 227520, 0.0)]

The final 1/7 of the scaffold has the densest guide yield.


In [ ]:
#Modify resultssorted to only include the 3.4MB region used. (18121076 to (21505465+786) = 21506251)
resultssorted = [item for item in resultssorted if int(item[1].name) >= 18121076 and int(item[1].name) <= 21506251]
scores = [score for score, details in resultssorted]

Designing primers

We want to make sure we design primers to make PCR products spanning as many guides as possible. The challenge is making sure we prime in a way that covers all of the high-specificity guides and none of the adjacent low specificity guides. If a "good" and a "bad" guide are very close together, the constraints on where primers can be designed against are tight.


In [ ]:
# Set up the input for primer3:
# Sequence available to PCR:
guide_count = []
amps_in_3MB = []

for index, item in enumerate(goodruns[0:400]):
    left_outside = item[0][0].id[-1]
    left_inside = item[2][0][1].id[-1]
    if left_outside == "F" and left_inside == "R":
        permissible_start = int(item[0][0].name) + 10
        required_start_absolute = int(item[2][0][1].name) +14
    elif left_outside == "R" and left_inside == "R":
        permissible_start = int(item[0][0].name) + 1
        required_start_absolute = int(item[2][0][1].name) +14
    elif left_outside == "R" and left_inside == "F":
        permissible_start = int(item[0][0].name) + 1
        required_start_absolute = int(item[2][0][1].name) +18
    elif left_outside == "F" and left_inside == "F":
        permissible_start = int(item[0][0].name) + 10
        required_start_absolute = int(item[2][0][1].name) +18
    else:
        print("error on left")
    
    right_inside = item[2][-1][1].id[-1]
    right_outside = item[0][1].id[-1]
    if right_outside == "F" and right_inside == "R":
        permissible_end = int(item[0][1].name) + 19 
        required_end_absolute = int(item[2][-1][1].name) + 2
    elif right_outside == "R" and right_inside == "F":
        permissible_end = int(item[0][1].name) + 10
        required_end_absolute = int(item[2][-1][1].name) + 8 
    elif right_outside == "R" and right_inside == "R":
        permissible_end = int(item[0][1].name) + 10
        required_end_absolute = int(item[2][-1][1].name) + 2
    elif right_outside == "F" and right_inside == "F":
        permissible_end = int(item[0][1].name) + 19
        required_end_absolute = int(item[2][-1][1].name) + 8
    else:
        print("error on right")

    amp = longscaffold[0][permissible_start:permissible_end]
    # Bounds that need to be included in PCR product :
    required_start_relative = required_start_absolute-permissible_start
    required_end_relative = required_end_absolute - permissible_start
    amp.dbxrefs=((required_start_relative, required_end_relative))
# Set up some other stuff:
    amp.name =str(item[0][0].name)
    amp.id =str(item[0][0].name)
    amp.description=str(item[1])
    amp.seq.alphabet = IUPACAmbiguousDNA()
    if "NNNNN" in amp.seq: # Exclude if it has runs of Ns
        None
        #print amp.name + " contains ns " + str(item[1])
    else:
        amps_in_3MB.append(amp)
        guide_count.append(item[1])
amps_in_3MB_gen = (i for i in amps_in_3MB)

print sum(guide_count[0:144])

In [ ]:
with open("primerlist.txt", "w") as primerlist:
    primerlist.write("Sequence_id\tforward_seq\tforward_start\tforward_length\tforward_tm\tforward_gc\treverse_seq\treverse_start\treverse_length\treverse_tm\treverse_gc\tinput_seq_length\tPCR_product_length\tGuides_Contained\n")
    primerlist.close()

for item in amps_in_3MB:
    current_amp = item
    primerdict = al_primersearch(current_amp)
    al_collect_good_primers(item, primerdict)

This outputs to "primerlist.txt" the final list of primers to PCR from the Xenopus genome, listed in Table 3 (this is tab-separated so can be imported into Excel). After PCR, the products are pooled and subjected to the EATING protocol.